Introduction The data we used are related to health insurance that were simulated based on demographic statistics collected by the United States Census Bureau (USCB), available at “https://www.kaggle.com/mirichoi0218/insurance”.
The variables in this dataset correspond to parameters that influence the value of medical expenses charged by health insurance. For the implementation of linear regression, we selected the charges as the dependent variable since it depends on all the others. Therefore, with this work, we aim to assess the influence of age, gender, BMI, number of children, smoking status, and region of residence of each insured individual on the individual medical expenses charged by health insurance.
Variables Age: age of the beneficiary/insured;
Gender: gender of the insured (male/female);
BMI: body mass index that provides insight into the body, indicating weights that are relatively high or low compared to height. Ideally, the objective body weight index will be from 18.5 to 24.9 kg/m2, for which the ratio of height to weight is used;
Children: number of children covered by health insurance (number of dependents);
Smoker: indicator of whether they smoke (yes/no);
Region: region of the US where the beneficiary lives (northeast, southeast, southwest, northwest);
Charges: individual medical expenses charged by health insurance.
Data Importation
# Importação do ficheiro csv
insurance = read.csv(file='insurance.csv', header= T, sep=';', dec ='.')
After importing the selected file from the aforementioned database, we factorized the age and BMI variables.
For the age variable, the following age groups were defined:
For the BMI variable, the following divisions were made:
# Fatorização da variável idade em relação às taxas
jovem_adulto = (insurance$charges[0:444])
idade_media = (insurance$charges[445:838])
idade_avancada = (insurance$charges[839:1338])
values = c(jovem_adulto,idade_media,idade_avancada)
ind = c(rep('1',length(jovem_adulto)),
rep('2',length(idade_media)),
rep('3',length(idade_avancada)))
idade = factor(ind)
# Fatorização da variável bmi em relação às taxas
indice_massa = insurance[with(insurance, order(insurance$bmi, insurance$charges)), ]
baixo_peso = (indice_massa$charges[0:21])
peso_normal = (indice_massa$charges[22:245])
sobrepeso = (indice_massa$charges[246:631])
obeso = (indice_massa$charges[632:1338])
values2 = c(baixo_peso,peso_normal,sobrepeso,obeso)
ind2 = c(rep('1',length(baixo_peso)),
rep('2',length(peso_normal)),
rep('3',length(sobrepeso)),
rep('4',length(obeso)))
bmi = factor(ind2)
# Informação preliminar do dataset
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 18 18 18 18 18 18 18 18 18 18 ...
## $ sex : chr "male" "male" "female" "female" ...
## $ bmi : num 33.8 34.1 26.3 38.7 35.6 ...
## $ children: int 1 0 0 2 0 2 0 0 0 0 ...
## $ smoker : chr "no" "no" "no" "no" ...
## $ region : chr "southeast" "southeast" "northeast" "northeast" ...
## $ charges : num 1726 1137 2198 3393 2211 ...
#Verificação da hipótese de existir dados em falta
any(is.na(insurance))
## [1] FALSE
The dataset does not contain missing data (NA).
This dataset contains data from 1338 individuals, distributed across 7 variables as explained earlier, with 3 of them being continuous, one discrete (number of children), and 3 categorical. Among the categorical variables, two are binary (gender and smoker), and the other has 4 levels (region). The predictive variables include age, gender, BMI, number of children, smoker, and region.
#install.packages("summarytools")
library(summarytools)
dfSummary(insurance,na.col=F,valid.col=F)
## Data Frame Summary
## insurance
## Dimensions: 1338 x 7
## Duplicates: 1
##
## ---------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph
## ---- ------------- ----------------------------- ---------------------- ---------------------
## 1 age Mean (sd) : 39.2 (14) 47 distinct values :
## [integer] min < med < max: :
## 18 < 39 < 64 : : . : : . : . : .
## IQR (CV) : 24 (0.4) : : : : : : : : : :
## : : : : : : : : : :
##
## 2 sex 1. female 662 (49.5%) IIIIIIIII
## [character] 2. male 676 (50.5%) IIIIIIIIII
##
## 3 bmi Mean (sd) : 30.7 (6.1) 548 distinct values : :
## [numeric] min < med < max: : :
## 16 < 30.4 < 53.1 . : : :
## IQR (CV) : 8.4 (0.2) : : : :
## . : : : : :
##
## 4 children Mean (sd) : 1.1 (1.2) 0 : 574 (42.9%) IIIIIIII
## [integer] min < med < max: 1 : 324 (24.2%) IIII
## 0 < 1 < 5 2 : 240 (17.9%) III
## IQR (CV) : 2 (1.1) 3 : 157 (11.7%) II
## 4 : 25 ( 1.9%)
## 5 : 18 ( 1.3%)
##
## 5 smoker 1. no 1064 (79.5%) IIIIIIIIIIIIIII
## [character] 2. yes 274 (20.5%) IIII
##
## 6 region 1. northeast 324 (24.2%) IIII
## [character] 2. northwest 325 (24.3%) IIII
## 3. southeast 364 (27.2%) IIIII
## 4. southwest 325 (24.3%) IIII
##
## 7 charges Mean (sd) : 13270.4 (12110) 1337 distinct values :
## [numeric] min < med < max: : .
## 1121.9 < 9382 < 63770.4 : :
## IQR (CV) : 11899.6 (0.9) : :
## : : : : . . . .
## ---------------------------------------------------------------------------------------------
The descriptive statistics are presented above.
var(insurance$age)
## [1] 197.4014
sd(insurance$age)
## [1] 14.04996
boxplot (values3~idade2, names = c('jovem_adulto','idade_media','idade_avançada'), xlab = 'Faixas etárias', ylab = 'Idades', col = c('lightpink','lightgreen','lightblue'),main = 'Distribuição das idades por faixas etárias')
genero = table(insurance$sex)
barplot (genero, ylim = c (0,700), names.arg = c('Feminino','Masculino'), col = c('lightpink','lightblue'), ylab = 'Número de pessoas', main = 'Número de pessoas por género' )
var(insurance$bmi)
## [1] 37.18788
sd(insurance$bmi)
## [1] 6.098187
boxplot (values4~bmi2, names = c('baixo_peso','peso_normal','sobrepeso','obeso'), xlab = 'Faixas de BMI', ylab = 'Valores de BMI', col = c('lightpink','lightgreen','lightblue', 'lightyellow'),main = 'Distribuição das faixas de BMI por índice de massa corporal')
var(insurance$children)
## [1] 1.453213
sd(insurance$children)
## [1] 1.205493
boxplot(insurance$children, ylab = 'Número de filhos', main = 'Distribuição de filhos na população', col = 'lightblue')
color <- colorRampPalette(c("darkblue","lightblue"))
barplot(table (insurance$children), ylim = c(0,700), xlab = 'Número de filhos', ylab = 'Número de pessoas', main = 'Distribuição de filhos na população', col = color (6))
barplot (table(insurance$smoker), ylim = c(0,1200), names = c('Não fumador','Fumador'), main = 'Distribuição de fumadores na população',col =c('lightgreen','lightblue'), ylab = 'Número de pessoas')
regiao = table (insurance$region)
barplot (regiao, ylim = c(0,400), main = 'Distribuição das regiões',col =c('lightgreen','lightblue','lightpink','lightyellow'),ylab = 'Número de pessoas')
var(insurance$charges)
## [1] 146652372
sd(insurance$charges)
## [1] 12110.01
boxplot (insurance$charges, ylim= c(0,65000), main = 'Distribuição dos taxas médicos',ylab = 'Preço ($)',col = 'lightyellow')
Since the dependent variable we aim to study is “charges”, an analysis of this variable was conducted compared to the independent variables (gender, age, region, and BMI).
# Boxplot da distribuição das taxas por faixas etárias
boxplot(values~idade,names = c('jovem_adulto','idade_media','idade_avançada'), xlab = 'Faixas etárias', ylab = 'Taxas', col = c('lightpink','lightgreen','lightblue'),main = 'Distribuição das taxas por faixas etárias')
#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(insurance$charges~idade)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: insurance$charges by idade
## Fligner-Killeen:med chi-squared = 7.3869, df = 2, p-value = 0.02489
oneway.test(insurance$charges~idade,var.equal=F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: insurance$charges and idade
## F = 54.7, num df = 2.00, denom df = 866.79, p-value < 2.2e-16
model= aov(insurance$charges~idade)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## idade 2 1.453e+10 7.267e+09 53.44 <2e-16 ***
## Residuals 1335 1.815e+11 1.360e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = insurance$charges ~ idade)
##
## $idade
## diff lwr upr p adj
## 2-1 3249.904 1356.188 5143.619 0.0001765
## 3-1 7802.877 6018.684 9587.070 0.0000000
## 3-2 4552.973 2709.792 6396.154 0.0000000
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.
Outliers were observed for all factors in the boxplot of charges by age groups. The factor with a relatively higher mean concerning charges is elderly age group. Regarding the ANOVA analysis, significant differences were observed between middle age and young adult, elderly and young adult, and between elderly and middle age.
boxplot(charges~sex, data=insurance, col = c('lightpink','lightgreen'), main = 'Distribuição das taxas por géneros', xlab = 'Género', ylab = 'Taxas', names = c('Feminino','Masculino'))
#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(insurance$charges~insurance$sex)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: insurance$charges by insurance$sex
## Fligner-Killeen:med chi-squared = 9.4445, df = 1, p-value = 0.002118
oneway.test(insurance$charges~insurance$sex,var.equal=F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: insurance$charges and insurance$sex
## F = 4.4137, num df = 1.0, denom df = 1313.4, p-value = 0.03584
model= aov(insurance$charges~insurance$sex)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## insurance$sex 1 6.436e+08 643590180 4.4 0.0361 *
## Residuals 1336 1.954e+11 146280413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = insurance$charges ~ insurance$sex)
##
## $`insurance$sex`
## diff lwr upr p adj
## male-female 1387.172 89.81229 2684.532 0.0361327
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.
Through the boxplot of charges by gender, it is evident that both factors have outliers. The ANOVA analysis reveals significant differences between the levels (Female-Male) and the dependent variable.
# Boxplot da distribuição das taxas por Índice de massa corporal
boxplot(values2~bmi, names = c('baixo_peso', 'peso_normal', 'sobrepeso', 'obeso'), xlab = 'Índice de massa Corporal', ylab = 'Taxas', col = c('lightpink','lightgreen','lightblue', 'lightyellow'),main = 'Distribuição das taxas por índice de massa corporal')
#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(indice_massa$charges~bmi)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: indice_massa$charges by bmi
## Fligner-Killeen:med chi-squared = 34.434, df = 3, p-value = 1.604e-07
oneway.test(indice_massa$charges~bmi,var.equal=F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: indice_massa$charges and bmi
## F = 20.341, num df = 3.000, denom df = 97.562, p-value = 2.556e-10
model= aov(indice_massa$charges~bmi)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## bmi 3 7.941e+09 2.647e+09 18.77 6.3e-12 ***
## Residuals 1334 1.881e+11 1.410e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = indice_massa$charges ~ bmi)
##
## $bmi
## diff lwr upr p adj
## 2-1 1776.9104 -5194.5001 8748.321 0.9135832
## 3-1 2329.8892 -4514.9803 9174.759 0.8175344
## 4-1 6894.7148 130.4966 13658.933 0.0437917
## 3-2 552.9788 -2012.7959 3118.754 0.9453699
## 4-2 5117.8044 2775.6668 7459.942 0.0000001
## 4-3 4564.8256 2631.6202 6498.031 0.0000000
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.
According to the analysis of the boxplot of charges by BMI, outliers are only present for “obesity”, which also has the highest mean regarding charges. Through the ANOVA results analysis, significant differences are observed only between obesity and the other factors (underweight, normal weight, and overweight). Thus, the other factors do not show differences between them.
boxplot(charges~children, data=insurance, col = color (6), main = 'Distribuição das taxas por número de filhos', xlab = 'Número de filhos', ylab = 'Taxas')
#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
children = factor (insurance$children)
fligner.test(insurance$charges~children)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: insurance$charges by children
## Fligner-Killeen:med chi-squared = 22.198, df = 5, p-value = 0.0004801
oneway.test(insurance$charges~children,var.equal=F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: insurance$charges and children
## F = 6.8999, num df = 5.00, denom df = 122.04, p-value = 1.051e-05
model= aov(insurance$charges~children)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## children 5 2.397e+09 479383343 3.297 0.00579 **
## Residuals 1332 1.937e+11 145403382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = insurance$charges ~ children)
##
## $children
## diff lwr upr p adj
## 1-0 365.1962 -2026.08480 2756.477 0.9980122
## 2-0 2707.5881 62.30881 5352.867 0.0412937
## 3-0 2989.3428 -110.03004 6088.716 0.0660149
## 4-0 1484.6807 -5546.17786 8515.539 0.9908612
## 5-0 -3579.9404 -11817.32899 4657.448 0.8169344
## 2-1 2342.3919 -588.38208 5273.166 0.2025346
## 3-1 2624.1465 -722.20151 5970.495 0.2210585
## 4-1 1119.4845 -6023.68749 8262.656 0.9977503
## 5-1 -3945.1366 -12278.59355 4388.320 0.7562079
## 3-2 281.7546 -3250.57080 3814.080 0.9999166
## 4-2 -1222.9074 -8455.07055 6009.256 0.9967694
## 5-2 -6287.5285 -14697.39071 2122.334 0.2705133
## 4-3 -1504.6621 -8914.97868 5905.655 0.9923749
## 5-3 -6569.2831 -15132.83330 1994.267 0.2433597
## 5-4 -5064.6211 -15702.34884 5573.107 0.7517252
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.
According to the analysis of the boxplot of charges by the number of children variable, outliers are present. The factor 4 (number of children) has the highest mean regarding charges. Through the ANOVA results analysis, significant differences are observed between 0 and 2 children.
boxplot(charges~smoker, data=insurance, col = c('lightpink','lightgreen'), main = 'Distribuição das taxas por fumadores', xlab = 'Fumadores', ylab = 'Taxas', names = c('Não','Sim'))
#Procede-se à execução do fligner.test para comprovar a homogeneidade das variâncias.
fligner.test(insurance$charges~insurance$smoker)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: insurance$charges by insurance$smoker
## Fligner-Killeen:med chi-squared = 238.15, df = 1, p-value < 2.2e-16
oneway.test(insurance$charges~insurance$smoker,var.equal = F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: insurance$charges and insurance$smoker
## F = 1072.7, num df = 1.00, denom df = 311.85, p-value < 2.2e-16
oneway.test(insurance$charges~insurance$smoker,var.equal = T)
##
## One-way analysis of means
##
## data: insurance$charges and insurance$smoker
## F = 2177.6, num df = 1, denom df = 1336, p-value < 2.2e-16
model= aov(insurance$charges~insurance$smoker)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## insurance$smoker 1 1.215e+11 1.215e+11 2178 <2e-16 ***
## Residuals 1336 7.455e+10 5.580e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = insurance$charges ~ insurance$smoker)
##
## $`insurance$smoker`
## diff lwr upr p adj
## yes-no 23615.96 22623.17 24608.75 0
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.
According to the analysis of the boxplot of charges by the smoker variable, outliers are only present for “no”, which also has the lowest mean regarding charges. Through the ANOVA results analysis, significant differences are observed between the levels (No/Yes).
boxplot(charges~region, data=insurance, col = c('lightpink','lightgreen', 'lightblue', 'lightyellow'), main = 'Distribuição das taxas por regiões', xlab = 'Região', ylab = 'Taxas')
#Procede-se à execução do fligner.test e shapiro.test para comprovar a homogeneidade das variâncias
fligner.test(insurance$charges~insurance$region)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: insurance$charges by insurance$region
## Fligner-Killeen:med chi-squared = 19.233, df = 3, p-value = 0.0002447
oneway.test(insurance$charges~insurance$region,var.equal=F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: insurance$charges and insurance$region
## F = 2.6081, num df = 3.00, denom df = 740.96, p-value = 0.05059
model= aov(insurance$charges~insurance$region)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## insurance$region 3 1.301e+09 433586560 2.97 0.0309 *
## Residuals 1334 1.948e+11 146007093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = insurance$charges ~ insurance$region)
##
## $`insurance$region`
## diff lwr upr p adj
## northwest-northeast -988.8091 -3428.93434 1451.31605 0.7245243
## southeast-northeast 1329.0269 -1044.94167 3702.99551 0.4745046
## southwest-northeast -1059.4471 -3499.57234 1380.67806 0.6792086
## southeast-northwest 2317.8361 -54.19944 4689.87157 0.0582938
## southwest-northwest -70.6380 -2508.88256 2367.60656 0.9998516
## southwest-southeast -2388.4741 -4760.50957 -16.43855 0.0476896
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis that declares the homogeneity of variances.
For an alpha of 0.05, since the p-value is less than alpha, we reject the null hypothesis, thus assuming that there are significant differences. Therefore, we proceed with Tukey’s Honestly Significant Difference (TukeyHSD) test, which is a multiple comparison test.
Regarding the region variable, outliers are present in all factors. It is noteworthy that the factor with the highest mean regarding charges is “northeast”. Considering the results from the ANOVA analysis, significant differences are observed only between southwest and southeast. Therefore, the other factors do not show differences between them.
Based on the Pearson correlation coefficient, the correlation between the quantitative variables was performed, which in this case are also continuous.
# Criação de dataset com apenas variáveis numéricas
insurance_num <- insurance
insurance_num$sex<- NULL
insurance_num$region<- NULL
insurance_num$smoker<- NULL
# Correlações
#install.packages('ggplot2')
library (ggplot2)
library (GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(insurance_num, geom="tile", label= T, label_alpha=F, label_round=2)
Therefore, only correlations with values greater than 0.5 were considered relevant for the case study, assuming them as strong and positive correlations (1 is considered a perfect linear relationship).
Since none of the correlations obtained a value greater than 0.5, we can verify that none of the variables has a strong correlation with the insurance charges. However, “charges” has a weak positive correlation with “age”, “bmi”, and “children”, with the highest correlation being between “charges” and “age” (0.3), which is logical and expected.
full.model<- lm(charges~., data=insurance)
summary(full.model)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
#Teste a multicolinearidade
car::vif(full.model)
## GVIF Df GVIF^(1/(2*Df))
## age 1.016822 1 1.008376
## sex 1.008900 1 1.004440
## bmi 1.106630 1 1.051965
## children 1.004011 1 1.002003
## smoker 1.012074 1 1.006019
## region 1.098893 3 1.015841
Diagnostic measures: The p-value and F-statistics have very low values, indicating that at least one predictor variable is related to the charges.
R2 -> 1, the value tends to 1, indicating that the predictor variables are well fitted in the model.
RSE -> 0, values below 0 produce lower model errors.
The value of RSE is 6062 and R2 is 75%. The intercept is -11938.5 and almost all predictor variables, except gender and the northwest region, are significant according to their p-values. The interpretation of categorical variables, e.g., “smoker”, can be interpreted as “average charges increase by 23848.5 if the individual is a smoker - with all other variables held constant”. The coefficient value, when significant, is the average change in charges with a one-unit increase in the predictor variable - with the others held constant. While correlation measures the strength of the relationship, the coefficient quantifies the relationship and allows predictions from an equation. For example, for each unit added to age, the expected average charge is 256.9 higher - after controlling for other variables.
library (MASS)
step.model <- stepAIC(full.model, direction="both", trace=FALSE)
summary(step.model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
AIC(step.model)
## [1] 27113.66
step.modelf <- stepAIC(full.model, direction='forward', trace=FALSE)
summary(step.modelf)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
AIC(step.modelf)
## [1] 27115.51
The Residuals vs Fitted plot linearly relates the residuals and the predictor variable.
The Normal Q-Q plot visually assesses whether the residuals follow a normal distribution. Later, a Shapiro test was performed to verify normality.
The Scale-Location plot checks the homogeneity of residuals and the constant variance regression criterion.
The Residuals vs Leverage plot identifies the values that most influence the regression.
step.modelb <- stepAIC(full.model, direction='backward', trace=FALSE)
summary(step.modelb)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
AIC(step.modelb)
## [1] 27113.66
plot(step.modelb)
No gráfico Residuals vs Fitted, the non-horizontal red line may indicate a non-linear relationship.
In the Normal Q-Q plot, we see that the residuals are not exactly on the straight line, indicating that they are not normally distributed.
In the Scale-Location plot, the non-straight line indicates heteroscedasticity (different variances for all observations).
The contradictory assumptions obtained indicate that this model does not allow for reliable conclusions.
Finally, the stepwise selection method was performed from the full model fit. For this, the ‘backward’, ‘forward’, and ‘both’ direction methods were used. Then, a comparison of the AIC values for the three aforementioned methods was conducted, and the ‘backward’ method was chosen because it has the lowest AIC value.
anova(full.model)
## Analysis of Variance Table
##
## Response: charges
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 1.7530e+10 1.7530e+10 477.0239 < 2.2e-16 ***
## sex 1 7.9167e+08 7.9167e+08 21.5425 3.803e-06 ***
## bmi 1 5.2576e+09 5.2576e+09 143.0687 < 2.2e-16 ***
## children 1 5.5111e+08 5.5111e+08 14.9966 0.0001129 ***
## smoker 1 1.2287e+11 1.2287e+11 3343.5022 < 2.2e-16 ***
## region 3 2.3343e+08 7.7810e+07 2.1173 0.0962211 .
## Residuals 1329 4.8840e+10 3.6749e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(step.modelf)
## Analysis of Variance Table
##
## Response: charges
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 1.7530e+10 1.7530e+10 477.0239 < 2.2e-16 ***
## sex 1 7.9167e+08 7.9167e+08 21.5425 3.803e-06 ***
## bmi 1 5.2576e+09 5.2576e+09 143.0687 < 2.2e-16 ***
## children 1 5.5111e+08 5.5111e+08 14.9966 0.0001129 ***
## smoker 1 1.2287e+11 1.2287e+11 3343.5022 < 2.2e-16 ***
## region 3 2.3343e+08 7.7810e+07 2.1173 0.0962211 .
## Residuals 1329 4.8840e+10 3.6749e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(step.modelb)
## Analysis of Variance Table
##
## Response: charges
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 1.7530e+10 1.7530e+10 477.3270 < 2.2e-16 ***
## bmi 1 5.4464e+09 5.4464e+09 148.3005 < 2.2e-16 ***
## children 1 5.7152e+08 5.7152e+08 15.5618 8.402e-05 ***
## smoker 1 1.2345e+11 1.2345e+11 3361.3366 < 2.2e-16 ***
## region 3 2.3320e+08 7.7734e+07 2.1166 0.09631 .
## Residuals 1330 4.8845e+10 3.6726e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Distância de cook para visualizar se há pontos influentes na regressão do step.modelb
n=1338
cook = cooks.distance(step.modelb)
pontInf = which(cook>4/n) # Para ver os pontos com distância de cook superior a uma unidade
pontInf
## 7 15 51 53 67 98 105 127 131 148 173 176 189 216 240 249
## 7 15 51 53 67 98 105 127 131 148 173 176 189 216 240 249
## 254 256 266 270 297 313 316 343 344 358 364 367 373 374 385 407
## 254 256 266 270 297 313 316 343 344 358 364 367 373 374 385 407
## 410 417 435 459 483 497 498 513 517 527 536 561 562 615 626 669
## 410 417 435 459 483 497 498 513 517 527 536 561 562 615 626 669
## 681 683 695 768 773 793 794 803 811 830 838 849 880 914 946 947
## 681 683 695 768 773 793 794 803 811 830 838 849 880 914 946 947
## 962 963 1015 1017 1026 1032 1036 1049 1051 1056 1063 1071 1080 1087 1103 1121
## 962 963 1015 1017 1026 1032 1036 1049 1051 1056 1063 1071 1080 1087 1103 1121
## 1126 1160 1223 1225 1229 1247 1257 1261 1266 1275 1280 1317
## 1126 1160 1223 1225 1229 1247 1257 1261 1266 1275 1280 1317
summary(cook)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000e-09 2.070e-05 1.120e-04 8.514e-04 1.024e-03 2.029e-02
A Cook’s distance is one method used to identify influential points in regression analysis. If the values of the distances are greater than 4/n, where n is the sample size, it indicates that these values exist and should be considered.
The Cook’s distance was used to check for the existence of points that may influence the regression. To determine if the distance is large enough to consider influential values, it must have a value greater than 4/n, which in this case is 0.003. Although on average the values of the distances are less than 0.003, we can see, for example, that the maximum value is much higher than the reference value, proving that there is at least one value influencing the results.
n= 1338
cook2 = cooks.distance(step.modelf)
pontInf = which(cook2>4/n) # Para ver os pontos com distância de cook superior a 1 unidade
summary(cook2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 2.098e-05 1.138e-04 8.415e-04 1.014e-03 1.986e-02
#leverage : hj>2(p+1)/n para que na observação possam ser considerados outliers
n = 1338
p = 7
limhj = (2*(p+1))/n
limhj
## [1] 0.01195815
As we obtained a value greater than 0, it means that outliers are being considered for the regressions. Given that we observe the existence of outliers, these may be influencing other results such as correlation.
# Consideração de outliers no comportamento do step.modelb
hj = hatvalues(step.modelb)
out = which(hj>limhj)
out
## 67 72 142 156 176 216 285 387 507 615 669 670 781 816 1023
## 67 72 142 156 176 216 285 387 507 615 669 670 781 816 1023
summary(hj)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003061 0.004523 0.005545 0.005979 0.007062 0.017189
From these results, we can see which outliers are influencing the regression in the step.modelb.
# Consideração de outliers no comportamento do step.modelf
hj2=hatvalues(step.modelf)
out=which(hj2>limhj)
out
## 67 72 142 156 176 216 244 285 387 448 507 514 577 615 664 665
## 67 72 142 156 176 216 244 285 387 448 507 514 577 615 664 665
## 669 670 721 781 816 846 861 1023 1080 1330
## 669 670 721 781 816 846 861 1023 1080 1330
summary(hj2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003824 0.005261 0.006290 0.006726 0.007850 0.018123
From these results, we can see which outliers are influencing the regression in the step.modelf.
shapiro.test(residuals(step.modelb))
##
## Shapiro-Wilk normality test
##
## data: residuals(step.modelb)
## W = 0.89909, p-value < 2.2e-16
#shapiro.test(residuals(cook))
t.test(residuals(step.modelb))
##
## One Sample t-test
##
## data: residuals(step.modelb)
## t = 1.0883e-15, df = 1337, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -324.1596 324.1596
## sample estimates:
## mean of x
## 1.798269e-13
sigma(step.modelb)/mean(insurance$charges)
## [1] 0.456668
plot(step.modelb)
plot(step.modelb$fitted.values, step.modelb$residuals,ylab="resíduos",xlab="previstos")
abline(h=0,lty=2,col=2)
Based on the alpha value being 0.05, since the p-value < alpha, we reject the null hypothesis that declares the normality of residuals. With a p-value = 1, we can conclude that the mean is 0, which aligns with one of the assumptions of ANOVA.
boxplot(step.modelb$residuals,col= 'lightgreen', ylim = c(-15000,35000))
boxplot.stats(step.modelb$residuals)
## $stats
## 626 694 735 495 838
## -9092.9812 -2837.7950 -979.7228 1361.9517 7641.1161
##
## $n
## [1] 1338
##
## $conf
## [1] -1161.1290 -798.3166
##
## $out
## 7 9 10 15 18 28 39
## 18510.707 -9479.257 8232.497 -10446.745 9568.061 8407.782 7943.774
## 41 51 53 63 66 93 98
## 9050.296 -10026.986 11957.905 -9523.572 8574.914 -9273.671 19940.302
## 105 126 127 129 130 131 136
## 20210.750 -9215.459 19770.789 9519.172 8829.309 15416.055 7927.626
## 138 148 165 173 179 189 216
## -9514.438 23176.632 8025.919 13709.305 -9757.760 20948.264 9754.756
## 240 244 245 248 249 254 256
## 21689.333 8195.297 12058.754 10683.320 19322.338 24078.626 16390.355
## 266 269 270 276 283 289 292
## 9663.772 9237.581 -10390.904 -9467.341 8504.305 9449.515 -10272.180
## 293 297 313 316 334 336 343
## 13428.724 19396.736 18044.334 -10844.171 -9613.907 -9554.128 -10807.594
## 344 352 354 357 358 360 364
## 12489.813 -9375.536 -9773.335 11014.077 -10223.873 12509.867 20310.684
## 367 373 374 385 394 398 399
## -9888.664 15669.645 19373.048 13366.889 12028.768 14303.257 12090.666
## 406 407 410 417 421 433 435
## -10482.763 13274.838 -10913.229 -10582.782 -9597.055 7762.998 -10023.764
## 439 442 444 459 465 466 483
## -10576.323 10653.680 -9871.992 25382.867 -9848.007 -9523.075 13943.730
## 485 486 497 498 510 511 513
## -9587.265 -9944.655 -9666.040 18157.414 -9978.286 -10818.821 23128.671
## 517 527 528 536 542 554 561
## -10965.789 22087.079 -10093.626 13007.301 -10122.613 -10199.702 14060.678
## 562 568 581 586 609 613 615
## -10137.637 -9744.146 -10132.397 -10289.095 10475.040 -9721.196 8649.263
## 620 623 626 652 654 656 669
## 13770.020 -10186.331 -11367.181 -10061.132 -9885.241 -10662.811 -10455.761
## 672 673 681 683 684 695 729
## 14068.079 -10015.062 -10120.609 -11091.218 -9791.223 16038.188 -10588.767
## 730 731 736 752 754 762 764
## -9946.512 -10433.716 -9695.454 -10661.519 7670.791 8355.364 7873.530
## 768 769 773 774 778 779 793
## -10835.506 -9874.389 -10860.249 -9977.042 -9449.399 -10157.174 -10885.504
## 794 797 803 811 830 832 838
## 13877.051 8269.353 21772.769 -10661.965 19281.879 8623.973 29935.533
## 844 849 863 872 876 880 884
## 7752.522 16384.764 -10123.028 -9990.787 9498.524 18472.082 -9249.045
## 886 889 892 895 897 900 911
## -9844.663 -9635.508 -9170.831 -9347.425 -9610.292 -10192.917 13411.785
## 914 916 921 936 938 946 947
## 15583.240 11822.622 -9143.649 -9978.541 -9790.762 -9829.981 16845.921
## 954 962 963 972 989 1004 1005
## 9076.598 20239.054 15532.302 12651.269 -10034.181 8067.725 -9978.441
## 1015 1017 1020 1026 1028 1030 1032
## -9998.844 15290.379 -9728.717 18794.274 12046.785 11216.939 18360.795
## 1034 1036 1037 1041 1049 1051 1056
## -9615.356 22062.375 7990.242 -9409.694 -9673.152 16133.936 15669.152
## 1063 1071 1074 1076 1078 1080 1087
## -10313.543 14243.689 12775.645 7664.944 -9305.114 23026.094 14533.842
## 1103 1107 1121 1126 1132 1157 1160
## 24425.453 8496.639 14076.049 -10580.262 -9290.259 8473.845 11973.828
## 1180 1208 1209 1222 1223 1225 1227
## 8154.065 7963.747 13308.487 8589.906 21964.149 17096.080 8355.571
## 1229 1243 1244 1247 1252 1255 1257
## 17147.735 7789.203 8585.316 15177.558 8199.265 7967.535 13368.624
## 1261 1266 1268 1275 1280 1296 1298
## 11566.051 20743.565 8073.743 12964.297 14720.164 8170.784 8980.802
## 1301 1317 1318 1320 1326 1337
## 8375.179 17223.178 8408.754 8673.634 8157.324 8853.505
In conclusion, we can observe that:
Regarding exploratory analysis, the mean and median in the age variable present very similar values, and ages are distributed similarly without outliers detected. The youngest patient is 18 years old, and the oldest is 64.
There are slightly more men than women, with this difference being around 1%.
In BMI, there is an asymmetric distribution of results as patients fit into different categories. The mean and median have very close values, almost identical. Some outliers can be detected in all groups, with a greater presence in the obese group, as expected since this group has the most samples. The minimum BMI in the sample is 16, and the maximum is 53.1.
In the case of children, the mean and median have very close values, but the distribution of values is highly disproportional, with about 43% having no children. Among patients who have children, most (24%) have 1 child, with the maximum in this sample being 5 children (1%).
There is a significant difference between smoking and non-smoking patients regarding their distribution (20% and 80%, respectively).
Regarding the regions variable, the values are distributed similarly except for the South East group, which has a slightly higher value (27%).
For the charges variable, we can see that the mean and median are very different from each other, showing a high skewness in the sample. From the boxplot, several outliers can be detected. The minimum value is 1121.9, and the maximum is 63770.4.
Through the ANOVA test, it was possible to see if there were significant differences in prices within each of the variables. It was demonstrated that, for example, being a smoker or being older causes an increase in the insurance value since they are presumably people who need greater medical attention due to a weaker health condition. The results obtained in this test were in line with those of the t-test performed at the end.
We also observed that when correlating quantitative variables with the charges variable, there were not very significant results except for one, which, although weak, was relatively close to 0.5 (threshold used to consider a strong correlation), which was age. This was an expected result because, as mentioned before, older people need much more medical attention than younger patients.
Finally, by examining the graphs obtained from the linear regression analysis and the Shapiro test, we saw that we did not obtain normality in our results. This was confirmed by checking the Cook’s distance and observing the information from the residuals boxplot, which showed the presence of several values that are significantly influencing the results. These outliers may have a significant impact, affecting the correlations previously analyzed considerably.
In addition to the outliers that are causing changes in the final results, some variables such as smoker and children present highly skewed distributions of values, which also influence the impact of these during the analysis.
We were able to verify that smoking is a significant indicator for the increase in insurance prices, caused by the need for greater medical attention, and we also observed that insurance prices do not increase gradually with the number of children. From the boxplot made of prices based on the number of children, we see that having 3 children covered by insurance seemed to be cheaper than having only 2 children, and having 5 children seems to lead to a smaller price increase than having 4. This can be explained by the unequal number of values being analyzed for each group, as we have, for example, 18 patients with 5 children and 324 with only 1.